-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ORM: Use pydantic to specify a schema for each ORM entity #6255
base: main
Are you sure you want to change the base?
Conversation
3848afd
to
c253259
Compare
c253259
to
fc52f41
Compare
49379d9
to
d1659b4
Compare
a7a6065
to
069bd1f
Compare
655fcf0
to
d778da8
Compare
07f3001
to
054c43c
Compare
054c43c
to
1c65856
Compare
7c7ef64
to
1e7f536
Compare
@sphuber just wondering, have you checked how this affects performance? |
This is implemented following the pydantic PR for aiida-core: aiidateam/aiida-core#6255
def serialize_computer(self, computer: Computer, _info): | ||
return computer.label | ||
|
||
# @model_validator(mode='after') # type: ignore[misc] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like some code that was forgotten
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for checking out this PR @agoscinski . For some background info: I commented this out on purpose for now. Ideally, this check should be there because when storing the code, this condition should be satisfied. Unfortunately, however, this now also prevents just instantiating this model that validates the check, even if we don't plan on storing it. This is problematic for the export behavior, because that would rely on instantiating a Model
that violates this uniqueness rules, even though in this case it doesn't apply.
The old CLI verdi code create core.code.installed
would also rely on something like this check to immediately return an error as soon as label
was specified in interactive mode and that label already existed for the same computer. It would then reprompt the user asking for a different label. I haven't been able to reimplement this feature in this PR so was keeping the comment as a reminder.
Safe to say that this PR is not completely done yet 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still "in progress"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sphuber I'm looking at this now. Trying to probe the issue you mention in this thread. I did the following:
- Create a
hello_world@localhost
installed code using verdi - Load the code and serialize it
- Create an
InstalledCode
instance from 2
I expected an issue here, asfrom_serialized
leverages the model, so the validator should've been triggered, raising an exception, as the label + computer already exist! No exception though.
As for your comment regarding the CLI, are you saying a similar validation should be implemented there? Sounds like it, but not in this PR, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summarizing a recent meeting with @sphuber:
- The currently commented out validator on the
Code.Model
is intended as a check on the uniqueness ofCode
label
+computer.label
tuples - However, one use of the pydantic
Model
of a code, the dynamic creation of the interactive options ofverdi code create...
, requires theModel
only for access to the available properties. Unfortunately, the validator will fire on instantiation from the serialized code, which will conflict with the uniqueness constraint, making it impossible to use theModel
in this case - The move to pydantic
Model
s for this CLI application regresses the user experience in the sense that it does not fire the validator until after the user finishes the interactive part, even though the tuple is already available within the first three options (note that this is not tied to this PR, as the use of pydanticModel
s for this case was implemented in an earlier PR)
The CLI matter is not blocking for this PR, as I understand it. A solution can be explored in a follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sphuber one thing that is still unclear to us is how in the CLI application a Model is created from an existing Code. Are we not discussing the creation of codes? How is there already a code?
It doesn't create a Model from an existing code. What the CLI code does is directly construct a Model from the values passed by the user to the CLI command. And then uses InstalledCode.from_model(model)
to instantiate the InstalledCode
instance which it then will store. The relevant line in the code:
https://github.com/sphuber/aiida-core/blob/d4c4d94851836e58e743d8f1a2d30c665af2ae3b/src/aiida/cmdline/commands/cmd_code.py#L35
So in that particular use case, the CLI implementation for creating codes, the fact that instantiating InstalledCode.Model
for a label
and computer
that already exist raises a validation error is actually exactly what we want. Because we intend to then store an actual InstalledCode
instance from that model directly after. The problem only comes in that now we can never instantiate this Model even if we never intend to store it as another ORM instance. The main example is when we want to serialize an existing ORM instance through the Model as an intermediate, as we have discussed in the comments above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sphuber. Two things:
- Regarding unintended effects of the
serialize_computer
guard, I would argue that any situation prior to the future merging of this PR that required the label of the computer was a situation in which the computer was indeed an instance ofComputer
and thus would not be affected by this change. The only cases wherecomputer
is astr
should be the serialization case, no? - Regarding the CLI case...
The problem only comes in that now we can never instantiate this Model even if we never intend to store it as another ORM instance. The main example is when we want to serialize an existing ORM instance through the Model as an intermediate, as we have discussed in the comments above.
Are you saying there IS a problem still, or are you simply reiterating the original problem? Because it seems to me that the use ofmodel_construct
is precisely the solution. Or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding unintended effects of the serialize_computer guard, I would argue that any situation prior to the future merging of this PR that required the label of the computer was a situation in which the computer was indeed an instance of Computer and thus would not be affected by this change. The only cases where computer is a str should be the serialization case, no?
I am not worried about existing code, since the serialization is completely new. But I was just saying that with adding this workaround, although it fixes the immediate problem with using model_construct
(which is itself already a bit of a workaround), I cannot guarantee that this will not lead to other problems/inconsistencies in the future. But perhaps this is still the best solution for now.
Are you saying there IS a problem still, or are you simply reiterating the original problem? Because it seems to me that the use of model_construct is precisely the solution. Or am I missing something?
No I was simply answering your question and providing more explanation. Using model_construct
in the serialization code does seem to workaround the problem for now (in combination with the guard in the computer serializer)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. Thanks for clarifying this matter 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think maybe there's another potentially critical issue here.
I tried reconstructing an installed code instance. With the validation active on the model, I first need to make the above change (and related changes) to Entity.to_model()
. This allows me to serialize the code. However, deserializing it using orm.InstalledCode.from_serialized(**code.serialize())
also raises the validation error. But here we shouldn't assume pre-validated input, so we can't use the model_construct
solution in from_serialized
.
I'm not sure how to get around this matter and allow the model validator. A REST API receiving a a serialized Code
instance cannot assume the data is valid. In the case that the data came from the same AiiDA instance with which the REST API is linked, the validation will surely object.
@sphuber am I overlooking something?
update
Thinking on this some more, "real" use cases of REST API node transfer will likely be between AiiDA instances, so the collision of the same full code label should not happen? But even if this is true, a specialized warning would be better. Not sure where to implement it though.
1e7f536
to
66f6096
Compare
@@ -132,6 +132,17 @@ requires-python = '>=3.9' | |||
'process.workflow.workchain' = 'aiida.orm.nodes.process.workflow.workchain:WorkChainNode' | |||
'process.workflow.workfunction' = 'aiida.orm.nodes.process.workflow.workfunction:WorkFunctionNode' | |||
|
|||
[project.entry-points.'aiida.orm'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really necessary to make orm pluggable, what is the use case of it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sphuber correct me if I'm wrong, but this has to do with the use of orm_class
, right? Under the implementation section of the AEP, it states that orm_class
is a shortcut to model_to_orm
used to define a model field that points to an ORM instance. So instead of
class Model(...):
user = MetadataField(
...,
model_to_orm=lambda pk: User.collection.get(id=pk),
)
you can do
class Model(...):
user = MetadataField(
...,
orm_class=User,
)
The various ORM classes that may be used in orm_class
are then defined in project.entry-points.'aiida.orm'
, so that you can do in Entity.model_to_orm_field_values(...)
:
if orm_class := get_metadata(field, 'orm_class'):
if isinstance(orm_class, str):
orm_class = BaseFactory('aiida.orm', orm_class)
fields[key] = orm_class.collection.get(id=field_value)
But then I guess my question is, do these not already have entry points? Or is the problem that you can't fetch them all generically using the BaseFactory
?
a7c6c13
to
9816200
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #6255 +/- ##
==========================================
+ Coverage 77.51% 77.97% +0.47%
==========================================
Files 560 566 +6
Lines 41444 42279 +835
==========================================
+ Hits 32120 32964 +844
+ Misses 9324 9315 -9 ☔ View full report in Codecov by Sentry. |
It would be nice to have some benchmarks, |
This is implemented following the pydantic PR for aiida-core: aiidateam/aiida-core#6255
* Introducing the kind checks. In principle, we can also not define the kinds, in that case it is up to the plugin to use the get_kinds() method of the StructureData to define kinds. * Starting to migrate to mkdocs * StructureData developer guide, missing: PropertyCollector and properties. * First profiling for the atomistic vs orm StructureData generation. Need more profiling for low-atoms systems, where it seems orm is better * Comment on profiling * Additional notes on profilings * Adding a last scaling test, to be commented and inserted in the github issue * Adding the automatic kind generation. Missing: deactivate this automatism. * Adding the `allow_kinds` input in the StructureData init. used to deactivate the automatic kind generation. * Adding the `to_legacy_structuredata` method in atomistc.StructureData This is provided in temporary way in order to make easier to migrate the plugin gradually to this new Data. In a first step, we should use this method anytime the plugin uses the StructureData. Then add the properties setting. * - Adding the developer docs - Adding some bugfixing for the empty StructureData init. * Added user guide with examples * Fixing #16 * This solves issue #10 This is implemented following the pydantic PR for aiida-core: aiidateam/aiida-core#6255 * New implementation, following the pydantic PR. adding the `charges` properties, in the same way as we have the cell and pbc: hard-coding the getter and setter methods. For now, no data validation is there. * Added the support to read charges also from ASE Atoms object. Added the map_kinds method, which return the mapping for the kinds as contained in the sites attribute. * Reverting back to non-pydantic implementation (Just removing the Model attribute from the StructureData class, nothing else changed) Added also the charges in the get_ase method. * Attaching the properties to the Sites. - code refactored: now `StructureData` and `Site` classes are in separated files, as well as the utils; `Kind` class is removed - `Site` class now contains the properties: `symbol`, `position`, `mass`, `charge`, `magnetization`, `kind_name`; `kind_name` should be removed. The properties are defined via the `property` decorator - `from_ase`, `to_ase` methods are added - `from_pymatgen`, `to_pymatgen` methods are added (**TOBE fixed to add also the magnetization**) - `from_file`, `to_file` methods are added. They rely on ASE `io.read` and `io.write` functions. - the `orm.StructureData` (the old one) now has a method `to_atomistic` - the `utils.py` file contains also the `to_kinds` and `get_kinds` functions, to automatically generate kinds. The user should only use the `get_kinds` function. **TOBE refined: kind names, and remove the other kinds methods from the StructureData class.** - the `StructureData` constructor now works in this way: we provide the pbc, cell. Then we should use the `append_atom` method afterwards. **TOBE discussed, I guess we cann add lists contaning site-wise value for properties** - slicing of structure data: defined the __getitem__ method, to obtain a sliced structuredata instance from the initial one. - TOBE added: `to_supercell` method. * adding the possiblity to provide lists in the constructor BUT we are gonna change this by providing the possibility to provide list of sites (each of them is a dictionary), in the same format as they are represented in the database * Moving the get_kinds and _to_kinds (now a private method) into the StructureData class. * Removed kind related methods which now are no more needed in the StructureData * Utilities for cell and sites update * New StructureData and StructureDataMutable both inheriting from StructureDataCore. Added properties: - charge - magmom Relevant added methods: - from/to ASE/Pymatgen - to_legacy - to_structuredata - to_mutable_structuredata Added single page for API tutorial. Added tests for basic functionalities of both Structure Classes. * Allowing also magmom to be provided as floats: they are then stored as [magmom,0,0] * Improving mutability and minor changes. Mutability is improved by removing all the setter methods, both in StructureDataCore and Site classes. Moreover, lists and tuples are returned as np.array with flags.writeable=False, meaning that we cannot even modify the internal arrays/lists. Other minor changes: - Now in StructureDataCore we have to define the pbc, cell and sites keywords to initialise it (and so the other subclasses), this is helpful to then document the inputs. - `get_kinds` routine fixed to work also with magmoms (arrays) - docu and tests updated with respect to these changes. * Adding type hinting and minor fixes removing mutable attribute in Site class adding a get_global_properties method in StructureDataCore, not used now but in the future to have a list of the properties in the database (not only attached to the single sites), to make easy to query wrt a property. * Improving flexibility, automatic get_kinds and pymatgen fixings. Providing full flexibility in the StructureDataMutable while preserving same data structure of StructureData. - adding the parent and index when we initialise the Site class (i.e. when we access structure.sites). This is helpful to be able to change directly properties in the StructureDataMutable class. - in the same direction, we implemented the ObservedArray class which basically allows to trigger the setter methods of arrays/lists even if we modify only one element (for example: structure.sites[0].position[0] = 5). In this way, changes are written into the structure._data, which represents the real data stored in the instance. `get_kinds` method now is stable. Fixing pymatgen problems with oxi_state and magmom. * Adding in the tutorial markdown also the get_kinds method, and minimal update of the pytests.
9816200
to
d4c4d94
Compare
prefixes = ('aiida.orm.nodes.',) | ||
prefixes = ('aiida.orm.nodes.', 'aiida.orm.core.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is related to the changes of adding the aiida.orm
entry point group and making it possible to load the ORM classes through an entry point. But I will have to go back to the PR and actually test to see if this is still necessary
@@ -375,7 +377,7 @@ def set_file(self, file, filename=None): | |||
from aiida.common.exceptions import ParsingError | |||
from aiida.common.files import md5_file, md5_from_filelike | |||
|
|||
parsed_data = parse_upf(file) | |||
parsed_data = parse_upf(file, check_filename=self.CHECK_FILENAME) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sphuber do you remember why you added this CHECK_FILENAME
class attribute?
Thanks for the review @edan-bainglass . I see that most your comments are questions. This is fair enough the PR doesn't provide any description of the changes and the reasoning behind it. Since most of your questions all refer to the same concept and so would be answered by the same answer, instead of going through them one by one, I will answer it in general here. The whole point of this PR is to provide a way that any ORM class can automatically be serialized to JSON and the same instance can be recovered from that serialized JSON payload. This is implemented in the Instead of manually implementing JSON (de)serializing for all classes, we use For all of this to work, we need to be able to clean map an instance of a I have been thinking whether those changes should be in separate commits. If we take the example of the The changes to the entry point handling, and adding |
@sphuber first, thank you so much for the detailed explanation of the PR. The scope/depth of change + uncharacteristic lack of explanation was a recipe for confusion. Note that at the time of you writing the above, I was only "half"-way through my review, as I had started it quite late in the day. I'm wrapping up the review at the moment, but with the above clarifications, the PR is now mostly crystal clear! And I think you can skip PR refactoring. It all makes sense now as a whole. Thanks for implementing this! It is pertinent to my current use case (query builder web app), which requires JSON (de)serialization of nodes via REST API calls. I believe you had mentioned before when we worked on the fields PR that this would precisely be where these changes would make an impact. Looking forward to this dropping! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sphuber thanks again for implementing this. As mentioned in my other comment, it is mostly clear. Left a few comments after your post that I think would still benefit from clarification. Other than that, LGTM!
@edan-bainglass one reason why I hadn't (yet) written an extensive documentation in the commits here, is because I actually put this in an AEP. There you will find an extensive description of the goal, design, implementation and considerations. I also provide examples of how these changes make it then trivial to write a REST API around it. I provide a working example with FastAPI. As I mentioned in a comment elsewhere, I will quickly revisit this and see if the changes to the plugin loading system are still required for the REST API use case. If not, I will revert those. I would encourage yourself to also start testing this PR out with your envisioned use case to see what works and what doesn't and what might be missing. |
@khsrali you mentioned you had some test cases. Could you provide these benchmarks against your scripts? 🙏 |
To be more precise, I have a scriptic way to run performance tests for |
457440b
to
0f52b0d
Compare
0f52b0d
to
bdef464
Compare
@sphuber btw, if you recall, we had tested the possibility of avoiding having to explicitly extend the |
@sphuber I did a small test on an old environment where I had a single |
See corresponding AEP for details.